Overview

This analysis explores a wheat dataset containing yield information across different countries, locations, and time periods. The primary objectives are:

  • Perform comprehensive data cleaning and quality assessment
  • Create interactive visualizations to understand geographic and temporal patterns
  • Assess data completeness and identify missing value patterns
  • Evaluate the dataset’s suitability for machine learning applications using quantitative metrics

Load Required Libraries

  library(data.table)    # For fast file reading
  library(tidyverse)     # Data manipulation
  library(knitr)         # Tables
  library(ggplot2)       # Static plots only
  library(scales)        # For formatting
  library(leaflet)
  library(gt)
  library(kableExtra)
  library(plotly)  # Add this to your libraries section

# Parallel processing
library(furrr)        # Parallel mapping functions

Data Loading and Initial Exploration

# Load wheat data with irrigation and rainfall calendar information
# Note: fileEncoding handles special characters in country/location names
data <- read.csv("../../../data/Data/wheat_data_with_calendar.csv", fileEncoding = "latin1")

# Initial data inspection
cat("Dataset dimensions:", dim(data), "\n")
## Dataset dimensions: 80382 86
cat("Number of countries:", length(unique(data$Country)), "\n")
## Number of countries: 42
cat("Time period covered:", range(data$Observation.period, na.rm = TRUE), "\n")
## Time period covered:  2022
# Display all column names for reference
names(data)
##  [1] "id"                                   
##  [2] "Data.ID"                              
##  [3] "Location"                             
##  [4] "State.Region.County.Province"         
##  [5] "Country"                              
##  [6] "Continent"                            
##  [7] "Latitude..N.S."                       
##  [8] "Longitude..E.W."                      
##  [9] "Conversion.for.latitude"              
## [10] "Conversion.for.longitude"             
## [11] "Location.source"                      
## [12] "Observation.period"                   
## [13] "Wheat.Type"                           
## [14] "Crop.variety"                         
## [15] "Tillage.type"                         
## [16] "Planting.date"                        
## [17] "Harvesting.date"                      
## [18] "Flowering.stage"                      
## [19] "Treatment"                            
## [20] "Treatment.type"                       
## [21] "Grain.yield..tons.ha.1."              
## [22] "SE...22"                              
## [23] "Climate"                              
## [24] "Mean.annual.temperature..Â.C."        
## [25] "Water.regime"                         
## [26] "Mean.annual.precipitation..mm."       
## [27] "Irrigation..mm."                      
## [28] "Soil.type"                            
## [29] "Soil.depth..cm."                      
## [30] "Sand...."                             
## [31] "Silt...."                             
## [32] "Clay...."                             
## [33] "Soil.texture"                         
## [34] "Soil.organic.carbon..g.C.kg.1."       
## [35] "Soil.organic.carbon...."              
## [36] "Category"                             
## [37] "TN..g.N.kg.1."                        
## [38] "Category.1"                           
## [39] "C.N.ratio"                            
## [40] "Category.2"                           
## [41] "Soil.pH"                              
## [42] "Category.3"                           
## [43] "BD..g.cm.3."                          
## [44] "Category.4"                           
## [45] "Replicates"                           
## [46] "N.type"                               
## [47] "Category.5"                           
## [48] "N.rate..kg.N.ha.1."                   
## [49] "N.fertilizer.management"              
## [50] "P.type"                               
## [51] "P.rate..kg.P.ha.1."                   
## [52] "Straw.return"                         
## [53] "Plastic.film.mulching"                
## [54] "Emissions..yes.no."                   
## [55] "Cumulative.N2O.fluxes..kg.N.ha.1."    
## [56] "SE...56"                              
## [57] "SD...57"                              
## [58] "Yield.scaled.N2O.emission..g.N.Mg.1." 
## [59] "SE...59"                              
## [60] "SD...60"                              
## [61] "PFPN..kg.kg..1."                      
## [62] "ANE..kg.kg.1."                        
## [63] "EFd...."                              
## [64] "Pest.prescence....64"                 
## [65] "Pest.detected...65"                   
## [66] "Pest.severity.score.......66"         
## [67] "Pest.abundance...67"                  
## [68] "Pest.prescence....68"                 
## [69] "Pest.detected...69"                   
## [70] "Pest.severity.score.......70"         
## [71] "Pest.abundance...71"                  
## [72] "Pest.prescence....72"                 
## [73] "Main.weed.detected"                   
## [74] "Total.weed.coverage"                  
## [75] "Main.weed.abundance.of.total.coverage"
## [76] "X"                                    
## [77] "Y"                                    
## [78] "unit_code"                            
## [79] "rainfed_start_date"                   
## [80] "rainfed_end_date"                     
## [81] "irr_start_date"                       
## [82] "irr_end_date"                         
## [83] "start_date"                           
## [84] "end_date"                             
## [85] "Planting.date.1"                      
## [86] "Harvesting.date.1"

Data Cleaning and Preprocessing

Column Selection and Type Conversion

# Select key variables and rename for easier handling
yield <- data %>%
  select(Country, 
         Observation.period, 
         Location, 
         Continent, 
         State.Region.County.Province, 
         longitude = Conversion.for.longitude,    # Rename for clarity
         latitude = Conversion.for.latitude,      # Rename for clarity
         yield_value = Grain.yield..tons.ha.1.) %>%  # Main outcome variable
  mutate(
    # Convert character columns to appropriate numeric types
    longitude = as.numeric(longitude),
    latitude = as.numeric(latitude),
    yield_value = as.numeric(yield_value),
    Observation.period = as.numeric(Observation.period)
  ) %>%  
  filter(
    # Remove problematic observations
    !is.na(yield_value),                    # Remove missing yield values
    !is.na(longitude),                      # Remove missing coordinates
    !is.na(latitude),
    yield_value <= 30,                      # Remove unrealistic yield outliers (>30 tons/ha)
    nchar(as.character(Observation.period)) == 4  # Keep only 4-digit years
  )

cat("Cleaned dataset dimensions:", dim(yield), "\n")
## Cleaned dataset dimensions: 79710 8
cat("Yield range:", range(yield$yield_value), "tons/ha\n")
## Yield range: 0.01673634 28.5 tons/ha

Data Quality Summary

# Summary statistics for the cleaned dataset
summary(yield)
##    Country          Observation.period   Location          Continent        
##  Length:79710       Min.   :1951       Length:79710       Length:79710      
##  Class :character   1st Qu.:2014       Class :character   Class :character  
##  Mode  :character   Median :2017       Mode  :character   Mode  :character  
##                     Mean   :2013                                            
##                     3rd Qu.:2018                                            
##                     Max.   :2022                                            
##  State.Region.County.Province   longitude           latitude     
##  Length:79710                 Min.   :-121.926   Min.   :-37.82  
##  Class :character             1st Qu.:-120.110   1st Qu.: 27.33  
##  Mode  :character             Median :-109.090   Median : 36.34  
##                               Mean   : -70.241   Mean   : 32.11  
##                               3rd Qu.:  -0.373   3rd Qu.: 38.54  
##                               Max.   : 152.390   Max.   : 60.70  
##   yield_value      
##  Min.   : 0.01674  
##  1st Qu.: 3.85785  
##  Median : 5.43659  
##  Mean   : 5.38656  
##  3rd Qu.: 6.88948  
##  Max.   :28.50000

Geographic Visualization

Interactive Leaflet Map

# Create color palette for yield values
pal <- colorNumeric(palette = "YlOrRd", domain = yield$yield_value)

# Create interactive map showing global wheat yield distribution
leaflet(yield) %>%
  addTiles() %>%  # Add base map tiles
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    color = ~pal(yield_value),  # Color by yield value
    popup = ~paste("Country:", Country, "<br>",
                   "Location:", Location, "<br>",
                   "Yield:", round(yield_value, 2), "tons/ha"),
    radius = 3,
    fillOpacity = 0.7
  ) %>%
  addLegend("bottomright", 
            pal = pal,
            values = ~yield_value, 
            title = "Yield (tons/ha)")

Map Features: - Color coding: Yellow to red gradient representing yield intensity - Interactive popups: Click markers for detailed information - Legend: Helps interpret yield values across locations

Temporal Analysis

Distribution of Observations Over Time

# Histogram showing data collection patterns over years
ggplot(yield, aes(x = Observation.period)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Wheat Yield Observations Over Time", 
       x = "Year", 
       y = "Number of Observations") +
  theme(plot.title = element_text(size = 14, face = "bold"))

Yearly Summary Statistics

# Summary of observations and yield by year
yearly_summary <- yield %>%
  group_by(Observation.period) %>%
  summarise(
    n_observations = n(),
    mean_yield = round(mean(yield_value), 2),
    median_yield = round(median(yield_value), 2),
    n_countries = n_distinct(Country),
    .groups = 'drop'
  ) %>%
  arrange(desc(Observation.period))

# Display recent years
head(yearly_summary, 10) %>%
  kable(caption = "Recent Years: Observations and Yield Statistics")
Recent Years: Observations and Yield Statistics
Observation.period n_observations mean_yield median_yield n_countries
2022 141 6.52 6.74 2
2021 252 5.36 6.21 3
2020 1646 4.96 5.00 6
2019 5776 4.56 4.47 21
2018 20115 5.81 6.00 21
2017 18942 5.38 5.55 18
2016 2335 4.33 3.72 12
2015 5943 4.74 4.66 15
2014 5804 5.21 5.25 14
2013 5152 5.39 6.08 10

Geographic Distribution Analysis

Observations by Country and Continent

# Bar plot showing observations by country, colored by continent
ggplot(yield, aes(x = reorder(Country, -table(Country)[Country]), fill = Continent)) +
  geom_bar() +
  theme_minimal() +
  labs(title = "Number of Wheat Yield Observations by Country and Continent", 
       x = "Country", 
       y = "Number of Observations") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 14, face = "bold")) +
  scale_fill_viridis_d(option = "plasma")

Country-Continent Summary Table

# Create organized summary table by country and continent
country_continent_table <- yield %>%
  group_by(Country, Continent) %>%
  summarise(
    count = n(),
    mean_yield = round(mean(yield_value), 2),
    .groups = 'drop'
  ) %>%
  arrange(desc(count))

# Display as formatted table
country_continent_table %>%
  head(15) %>%
  gt() %>%
  tab_header(title = "Top 15 Countries by Number of Observations") %>%
  cols_label(
    Country = "Country",
    Continent = "Continent", 
    count = "Observations",
    mean_yield = "Mean Yield (tons/ha)"
  ) %>%
  fmt_number(columns = mean_yield, decimals = 2)
Top 15 Countries by Number of Observations
Country Continent Observations Mean Yield (tons/ha)
United States of America North America 39915 5.78
Mexico North America 14294 5.17
United Kingdom Europe 7985 5.06
Kenya Africa 7053 4.99
Ethiopia Africa 3909 3.46
China Asia 1239 5.59
Germany Europe 1093 7.26
Turkey Asia 862 5.66
Iran Asia 392 5.26
Morocco Africa 384 5.35
Uzbekistan Asia 276 4.00
Algeria Africa 232 2.58
Russia Asia 230 4.28
Pakistan Asia 164 4.28
Ukraine Europe 160 4.31

Missing Data Analysis

Missing Data Heatmap by Country

# Prepare data for missing value heatmap
missing_heatmap_data <- data %>%
  group_by(Country) %>%
  summarise(across(everything(), ~mean(is.na(.)) * 100), .groups = 'drop') %>%
  pivot_longer(cols = -Country, 
               names_to = "Variable", 
               values_to = "Percent_Missing")
# Create heatmap visualization
ggplot(missing_heatmap_data, aes(x = Variable, y = Country, fill = Percent_Missing)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red", name = "% Missing") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 14, face = "bold")) +
  labs(title = "Missing Data Heatmap by Country", 
       x = "Variable", 
       y = "Country")

Interactive heatmap by country

# Percentage missing for all variables
pct_missing_all <- data %>%
  group_by(Country) %>%
  summarise(
    Total_Obs = n(),
    across(everything(), ~round(mean(is.na(.)) * 100, 2), .names = "Pct_NA_{.col}")
  ) %>%
  arrange(desc(Total_Obs))

# First, create the heatmap_data (this was missing)
heatmap_data <- pct_missing_all %>%
  select(-Total_Obs) %>%
  pivot_longer(cols = -Country, 
               names_to = "Variable", 
               values_to = "Pct_Missing",
               names_prefix = "Pct_NA_")

# Create interactive heatmap - use heatmap_data instead of data
p <- plot_ly(
  data = heatmap_data,  # This was the issue - should be heatmap_data, not data
  x = ~Variable,
  y = ~Country,
  z = ~Pct_Missing,
  type = "heatmap",
  colorscale = "RdYlBu",
  reversescale = TRUE,
  hovertemplate = "Country: %{y}<br>Variable: %{x}<br>Missing: %{z}%<extra></extra>"
) %>%
  layout(
    title = "Missing Data Heatmap",
    xaxis = list(title = "Variable"),
    yaxis = list(title = "Country")
  )

p

Machine Learning Readiness Assessment

ML Readiness Quantification Methodology

The ML readiness assessment uses multiple quantitative criteria to evaluate data sufficiency for machine learning applications.

Key Metrics Calculated:

  1. Sample Size (n_observations): Total number of yield observations
  2. Temporal Coverage (year_span): Number of years from first to last observation
  3. Temporal Uniqueness (n_unique_years): Number of distinct years with data
  4. Data Density (data_density): Observations per year (n_observations ÷ year_span)
  5. Yield Variability (yield_sd): Standard deviation indicating data diversity

ML Suitability Classification System:

The classification uses evidence-based thresholds derived from ML best practices:

# Create a detailed breakdown of ML readiness criteria
ml_criteria <- data.frame(
  Category = c("Excellent", "Good", "Moderate", "Limited", "Insufficient"),
  Min_Observations = c("≥20", "≥15", "≥10", "≥5", "<5"),
  Min_Year_Span = c("≥10", "≥7", "≥5", "≥3", "<3"),
  Min_Temporal_Coverage = c("≥60%", "≥50%", "≥40%", "Any", "Any"),
  Typical_Use_Cases = c(
    "Complex models, cross-validation, feature engineering",
    "Standard ML algorithms, basic validation",
    "Simple models, limited validation",
    "Exploratory analysis, trend detection", 
    "Insufficient for reliable ML"
  ),
  Recommended_Methods = c(
    "Random Forest, SVM, Neural Networks, Ensemble",
    "Linear/Logistic Regression, Decision Trees",
    "Simple Linear Models, Correlation Analysis",
    "Descriptive Statistics, Basic Trends",
    "Data Collection Priority"
  )
)

ml_criteria %>%
  gt() %>%
  tab_header(title = "ML Readiness Classification Criteria and Recommendations") %>%
  cols_label(
    Category = "ML Suitability",
    Min_Observations = "Min. Observations",
    Min_Year_Span = "Min. Year Span",
    Min_Temporal_Coverage = "Min. Temporal Coverage",
    Typical_Use_Cases = "Typical Use Cases",
    Recommended_Methods = "Recommended Methods"
  ) %>%
  data_color(
    columns = Category,
    colors = scales::col_factor(
      palette = c("red", "orange", "yellow", "lightgreen", "green"),
      domain = c("Insufficient", "Limited", "Moderate", "Good", "Excellent")
    )
  ) %>%
  cols_width(
    Category ~ px(100),
    Min_Observations ~ px(120),
    Min_Year_Span ~ px(120),
    Min_Temporal_Coverage ~ px(140),
    Typical_Use_Cases ~ px(200),
    Recommended_Methods ~ px(200)
  ) %>%
  tab_style(
    style = cell_text(size = px(11)),
    locations = cells_body()
  )
ML Readiness Classification Criteria and Recommendations
ML Suitability Min. Observations Min. Year Span Min. Temporal Coverage Typical Use Cases Recommended Methods
Excellent ≥20 ≥10 ≥60% Complex models, cross-validation, feature engineering Random Forest, SVM, Neural Networks, Ensemble
Good ≥15 ≥7 ≥50% Standard ML algorithms, basic validation Linear/Logistic Regression, Decision Trees
Moderate ≥10 ≥5 ≥40% Simple models, limited validation Simple Linear Models, Correlation Analysis
Limited ≥5 ≥3 Any Exploratory analysis, trend detection Descriptive Statistics, Basic Trends
Insufficient <5 <3 Any Insufficient for reliable ML Data Collection Priority

Country-Level Data Assessment

# Assess which countries have sufficient data for ML modeling
modeling_readiness <- yield %>%
  group_by(Country) %>%
  summarise(
    n_observations = n(),
    n_unique_years = n_distinct(Observation.period),
    year_span = max(Observation.period) - min(Observation.period) + 1,
    first_year = min(Observation.period),
    last_year = max(Observation.period),
    avg_yield = round(mean(yield_value), 2),
    yield_sd = round(sd(yield_value), 2),
    .groups = 'drop'
  ) %>%
  mutate(
    # Calculate data density (observations per year span)
    data_density = round(n_observations / year_span, 2),
    
    # Calculate temporal coverage ratio (unique years / total span)
    temporal_coverage = round(n_unique_years / year_span, 2),
    
    # Calculate coefficient of variation for yield diversity
    cv_yield = round((yield_sd / avg_yield) * 100, 1),
    
    # Classify suitability for ML based on multiple criteria
    ml_suitability = case_when(
      n_observations >= 20 & year_span >= 10 & temporal_coverage >= 0.6 ~ "Excellent",
      n_observations >= 15 & year_span >= 7 & temporal_coverage >= 0.5 ~ "Good", 
      n_observations >= 10 & year_span >= 5 & temporal_coverage >= 0.4 ~ "Moderate",
      n_observations >= 5 & year_span >= 3 ~ "Limited",
      TRUE ~ "Insufficient"
    ),
    
    # Add specific ML recommendations
    recommended_approach = case_when(
      ml_suitability == "Excellent" ~ "Advanced ML: Ensemble methods, feature engineering",
      ml_suitability == "Good" ~ "Standard ML: Random Forest, SVM, cross-validation",
      ml_suitability == "Moderate" ~ "Simple ML: Linear regression, basic validation",
      ml_suitability == "Limited" ~ "Exploratory: Trend analysis, correlation studies",
      TRUE ~ "Focus on data collection before ML attempts"
    )
  ) %>%
  arrange(desc(n_observations))

# Display comprehensive ML readiness assessment
modeling_readiness %>%
  head(15) %>%
  gt() %>%
  tab_header(title = "Comprehensive Machine Learning Readiness Assessment") %>%
  cols_label(
    Country = "Country",
    n_observations = "Total Obs",
    n_unique_years = "Unique Years", 
    year_span = "Year Span",
    temporal_coverage = "Temporal Coverage",
    data_density = "Obs/Year",
    cv_yield = "Yield CV (%)",
    avg_yield = "Mean Yield",
    ml_suitability = "ML Suitability",
    recommended_approach = "Recommended Approach"
  ) %>%
  data_color(
    columns = ml_suitability,
    colors = scales::col_factor(
      palette = c("red", "orange", "yellow", "lightgreen", "green"),
      domain = c("Insufficient", "Limited", "Moderate", "Good", "Excellent")
    )
  ) %>%
  data_color(
    columns = temporal_coverage,
    colors = scales::col_numeric(
      palette = c("red", "yellow", "green"),
      domain = c(0, 1)
    )
  ) %>%
  fmt_number(columns = c(temporal_coverage, data_density), decimals = 2) %>%
  fmt_number(columns = c(avg_yield), decimals = 1) %>%
  cols_width(
    Country ~ px(100),
    recommended_approach ~ px(250)
  ) %>%
  tab_style(
    style = cell_text(size = px(10)),
    locations = cells_body(columns = recommended_approach)
  )
Comprehensive Machine Learning Readiness Assessment
Country Total Obs Unique Years Year Span first_year last_year Mean Yield yield_sd Obs/Year Temporal Coverage Yield CV (%) ML Suitability Recommended Approach
United States of America 39915 41 41 1981 2021 5.8 2.08 973.54 1.00 36.0 Excellent Advanced ML: Ensemble methods, feature engineering
Mexico 14294 69 69 1951 2019 5.2 1.30 207.16 1.00 25.1 Excellent Advanced ML: Ensemble methods, feature engineering
United Kingdom 7985 55 55 1968 2022 5.1 2.42 145.18 1.00 47.8 Excellent Advanced ML: Ensemble methods, feature engineering
Kenya 7053 4 4 2017 2020 5.0 2.47 1,763.25 1.00 49.5 Limited Exploratory: Trend analysis, correlation studies
Ethiopia 3909 5 6 2014 2019 3.5 1.69 651.50 0.83 48.8 Moderate Simple ML: Linear regression, basic validation
China 1239 20 20 2003 2022 5.6 2.19 61.95 1.00 39.2 Excellent Advanced ML: Ensemble methods, feature engineering
Germany 1093 28 36 1981 2016 7.3 1.75 30.36 0.78 24.1 Excellent Advanced ML: Ensemble methods, feature engineering
Turkey 862 3 3 2017 2019 5.7 2.19 287.33 1.00 38.7 Limited Exploratory: Trend analysis, correlation studies
Iran 392 3 3 2017 2019 5.3 1.78 130.67 1.00 33.8 Limited Exploratory: Trend analysis, correlation studies
Morocco 384 2 2 2014 2015 5.3 1.92 192.00 1.00 35.9 Insufficient Focus on data collection before ML attempts
Uzbekistan 276 3 3 2017 2019 4.0 1.49 92.00 1.00 37.2 Limited Exploratory: Trend analysis, correlation studies
Algeria 232 2 3 2015 2017 2.6 1.02 77.33 0.67 39.5 Limited Exploratory: Trend analysis, correlation studies
Russia 230 2 2 2018 2019 4.3 1.03 115.00 1.00 24.1 Insufficient Focus on data collection before ML attempts
Pakistan 164 7 9 2010 2018 4.3 1.19 18.22 0.78 27.8 Good Standard ML: Random Forest, SVM, cross-validation
Ukraine 160 3 3 2017 2019 4.3 1.32 53.33 1.00 30.6 Limited Exploratory: Trend analysis, correlation studies

Advanced ML Readiness Scoring

# Create a composite ML readiness score (0-100)
modeling_readiness_scored <- modeling_readiness %>%
  mutate(
    # Sample size score (0-30 points)
    sample_score = pmin(30, (n_observations / 50) * 30),
    
    # Temporal coverage score (0-25 points)  
    temporal_score = pmin(25, temporal_coverage * 25),
    
    # Data density score (0-20 points)
    density_score = pmin(20, (data_density / 2) * 20),
    
    # Year span score (0-15 points)
    span_score = pmin(15, (year_span / 20) * 15),
    
    # Yield variability score (0-10 points) - rewards moderate variability
    variability_score = case_when(
      cv_yield >= 10 & cv_yield <= 30 ~ 10,  # Optimal variability
      cv_yield >= 5 & cv_yield < 10 ~ 7,     # Low variability
      cv_yield > 30 & cv_yield <= 50 ~ 7,    # High variability
      cv_yield > 50 ~ 3,                     # Very high variability
      TRUE ~ 5                               # Very low variability
    ),
    
    # Total ML readiness score
    ml_score = round(sample_score + temporal_score + density_score + span_score + variability_score, 1),
    
    # Score-based categories
    score_category = case_when(
      ml_score >= 80 ~ "Excellent (≥80)",
      ml_score >= 65 ~ "Good (65-79)",
      ml_score >= 45 ~ "Moderate (45-64)", 
      ml_score >= 25 ~ "Limited (25-44)",
      TRUE ~ "Insufficient (<25)"
    )
  ) %>%
  arrange(desc(ml_score))

# Display top countries by ML score
modeling_readiness_scored %>%
  select(Country, n_observations, year_span, temporal_coverage, 
         cv_yield, ml_score, score_category) %>%
  head(15) %>%
  gt() %>%
  tab_header(title = "Top 15 Countries by ML Readiness Score") %>%
  cols_label(
    Country = "Country",
    n_observations = "Observations",
    year_span = "Year Span", 
    temporal_coverage = "Temporal Coverage",
    cv_yield = "Yield CV (%)",
    ml_score = "ML Score",
    score_category = "Score Category"
  ) %>%
  data_color(
    columns = ml_score,
    colors = scales::col_numeric(
      palette = c("red", "orange", "yellow", "lightgreen", "green"),
      domain = c(0, 100)
    )
  ) %>%
  fmt_number(columns = c(temporal_coverage), decimals = 2) %>%
  fmt_number(columns = c(cv_yield, ml_score), decimals = 1)
Top 15 Countries by ML Readiness Score
Country Observations Year Span Temporal Coverage Yield CV (%) ML Score Score Category
Mexico 14294 69 1.00 25.1 100.0 Excellent (≥80)
United States of America 39915 41 1.00 36.0 97.0 Excellent (≥80)
United Kingdom 7985 55 1.00 47.8 97.0 Excellent (≥80)
China 1239 20 1.00 39.2 97.0 Excellent (≥80)
Germany 1093 36 0.78 24.1 94.5 Excellent (≥80)
Russia 230 2 1.00 24.1 86.5 Excellent (≥80)
Bulgaria 120 2 1.00 16.1 86.5 Excellent (≥80)
Croatia 116 2 1.00 19.1 86.5 Excellent (≥80)
Pakistan 164 9 0.78 27.8 86.2 Excellent (≥80)
Serbia 80 1 1.00 11.7 85.8 Excellent (≥80)
Brazil 64 6 0.83 23.0 85.2 Excellent (≥80)
Kenya 7053 4 1.00 49.5 85.0 Excellent (≥80)
India 137 19 0.53 30.6 84.5 Excellent (≥80)
Turkey 862 3 1.00 38.7 84.2 Excellent (≥80)
Iran 392 3 1.00 33.8 84.2 Excellent (≥80)

ML Readiness Distribution Analysis

# Analyze distribution of ML readiness scores
score_distribution <- modeling_readiness_scored %>%
  count(score_category) %>%
  mutate(
    percentage = round(n / sum(n) * 100, 1),
    score_category = factor(score_category, 
                           levels = c("Insufficient (<25)", "Limited (25-44)", 
                                    "Moderate (45-64)", "Good (65-79)", "Excellent (≥80)"))
  )

# Visualization of score distribution
ggplot(score_distribution, aes(x = score_category, y = n, fill = score_category)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = paste0(n, "\n(", percentage, "%)")), 
            vjust = 0.5, fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c("red", "orange", "yellow", "lightgreen", "green")) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    plot.title = element_text(size = 14, face = "bold")
  ) +
  labs(
    title = "Distribution of Countries by ML Readiness Score",
    x = "ML Readiness Category", 
    y = "Number of Countries",
    caption = "Higher scores indicate better suitability for machine learning applications"
  )

ML Suitability Summary

# Summary of ML suitability across all countries
ml_summary <- modeling_readiness %>%
  count(ml_suitability) %>%
  mutate(percentage = round(n / sum(n) * 100, 1))

ggplot(ml_summary, aes(x = reorder(ml_suitability, n), y = n, fill = ml_suitability)) +
  geom_col() +
  geom_text(aes(label = paste0(n, " (", percentage, "%)")), hjust = -0.1) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Distribution of Countries by ML Suitability",
       x = "ML Suitability Category",
       y = "Number of Countries") +
  scale_fill_manual(values = c("Insufficient" = "red", "Limited" = "orange", 
                               "Moderate" = "yellow", "Good" = "lightgreen", 
                               "Excellent" = "green")) +
  theme(legend.position = "none",
        plot.title = element_text(size = 14, face = "bold"))

Country-Level Summary for ML Planning

# gives a summary of number of observation years
yield_summary <- yield %>%
  group_by(Country) %>%
  summarise(
    n_years = n_distinct(Observation.period),
    year_range = paste(min(Observation.period), "-", max(Observation.period)),
    avg_yield = round(mean(yield_value), 2),
    yield_trend = ifelse(cor(Observation.period, yield_value) > 0, "Increasing", "Decreasing"),
    .groups = 'drop'
  ) %>%
  arrange(desc(n_years))

# Display summary with enhanced formatting
yield_summary %>%
  head(15) %>%
  gt() %>%
  tab_header(title = "Country Summary: Temporal Coverage and Yield Patterns") %>%
  cols_label(
    Country = "Country",
    n_years = "Unique Years",
    year_range = "Year Range", 
    avg_yield = "Average Yield (tons/ha)",
    yield_trend = "Overall Trend"
  ) %>%
  data_color(
    columns = yield_trend,
    colors = scales::col_factor(
      palette = c("lightcoral", "lightgreen"),
      domain = c("Decreasing", "Increasing")
    )
  )
Country Summary: Temporal Coverage and Yield Patterns
Country Unique Years Year Range Average Yield (tons/ha) Overall Trend
Mexico 69 1951 - 2019 5.17 Decreasing
United Kingdom 55 1968 - 2022 5.06 Increasing
United States of America 41 1981 - 2021 5.78 Decreasing
Germany 28 1981 - 2016 7.26 Decreasing
China 20 2003 - 2022 5.59 Increasing
India 10 1997 - 2015 4.18 Decreasing
Canada 9 2002 - 2018 3.09 Decreasing
Pakistan 7 2010 - 2018 4.28 Increasing
Australia 6 1997 - 2014 4.16 Increasing
Poland 6 2005 - 2017 6.38 Increasing
Spain 6 2011 - 2018 4.85 Decreasing
Brazil 5 2011 - 2016 2.96 Decreasing
Estonia 5 2004 - 2009 5.32 Increasing
Ethiopia 5 2014 - 2019 3.46 Decreasing
Nepal 5 2009 - 2014 2.42 Decreasing

Key Findings and Recommendations

Data Quality Summary

excellent_countries <- modeling_readiness %>% 
  filter(ml_suitability == "Excellent") %>% 
  nrow()

good_countries <- modeling_readiness %>% 
  filter(ml_suitability %in% c("Good", "Excellent")) %>% 
  nrow()

total_countries <- nrow(modeling_readiness)
total_observations <- nrow(yield)
  1. Geographic Coverage: The dataset covers 40 countries across 6 continents
  2. Temporal Coverage: Data spans from 1951 to 2022
  3. Data Volume: 79,710 clean observations after quality filtering
  4. Yield Range: From 0.0167363 to 28.5 tons/ha

Machine Learning Readiness Summary

  • Countries suitable for ML: 9 out of 40 countries (22.5%) have “Good” or “Excellent” data quality
  • High-priority countries: 5 countries have “Excellent” ratings for advanced ML applications
  • Score-based ranking: Composite scores (0-100) provide precise prioritization for ML efforts

Next Steps and Recommendations

  1. Priority Data Collection: Focus on countries with “Limited” or “Insufficient” ratings
  2. ML Pipeline Development: Start with “Excellent” rated countries for proof-of-concept
  3. Feature Engineering: Incorporate climate, soil, and socioeconomic variables
  4. Temporal Modeling: Develop country-specific time series forecasting models
  5. Spatial Analysis: Implement geographic clustering and regional modeling

Data Enhancement Opportunities

  • Environmental Variables: Temperature, precipitation, soil quality data
  • Agricultural Practices: Fertilizer use, irrigation methods, crop rotation
  • Economic Factors: Input costs, market prices, policy influences
  • Technology Adoption: Mechanization levels, seed varieties, precision agriculture

This comprehensive analysis provides both categorical and quantitative assessments of data quality, enabling informed decisions about machine learning applications and data collection priorities for wheat yield prediction.